Introduction to R

Python and R have a lot of similarities in the way they operate, but there are some slight differences.

Remember how I said computer languages are a lot like languages? Well, you’re about to become bilingual.

Depending on what your most comfortable with, you may think about how to say it in your primary language and then translate it to your secondar/tertiary language.

Python and R are kind of like Italian and Spanish - they’re different, but if you know one really well, learning the other language is not super hard. It will take work to master it, of course, but the translation is similar enough that you can figure it out quickly if you can find the write words you need to use.

A great resource that is listed in the syllabus is R for Data Science that is freely available online here - I heavily rely on this in the lecture. And another good option YaRrr! the Pirates Guide to R is here

CRAN & Mirrors

R codes are stored in a series of servers across the world through Comprehensive R Archive Network (CRAN) - all of R related information is stored separately in each server (with the same exact information)

So, likely, the first thing you’ll need to do is to use a mirror near you - a list of mirrors can be found here

Once you set it, it’s done. No need to go back to this step unless for some reason you want to pull from a different server (maybe if you move, one that is closer to you)

options("repos" = c(CRAN = "http://lib.stat.cmu.edu/R/CRAN/"))

We call packages through “library” Such as:

“library(package_name)”

You have to call the package that you want in your notebook. In a moment, we’re going to start working with tidyverse and dplyr. But, just to see the importing of a package, let’s try it here.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

If you want o install a package you can do so, like this:

`install.packages(dplyr)’

OR you can install through the R Studio browser - I’ll show you that now.

Basic Commands

Before we get to working with data, let’s do a quick over view of some basic commands

To change the working directory we use:

setwd(/your/working/directory/)

to find out where you are in your working directory we use:

`getwd()’

getwd()
## [1] "/Users/mkaltenberg/Documents/GitHub/Data_Analysis_Python_R/New R Kids on the Block - Intro to R"
#setwd('/Users/mkaltenberg/Documents/GitHub/Data_Analysis_Python_R/New R Kids on the Block/')

Assignments

In R documentation they almost always use <- but = also works.

x <- 'hi there!'
x
## [1] "hi there!"
y= 'bye'

You can change a variable name just as easily be reassigning the variable name.

(y <- seq(0, 10, 2))
## [1]  0  2  4  6  8 10

Reserved Words

Like in Python, some words are best to never use (so you don’t override core programs in R)

A full list can be found here

Import/Export Data

To read a csv file you use

read.csv(path_to_csv_file)

To save a csv file you use write.csv(df, path_to_csv_file)

#Don't forget to assign it!
jobs_r <- read.csv('job-automation-probability.csv')

Ok, now you. Go onto classes and download the file and import into R Studio now. If you have an error or other issues, share your screen and I can help you.

Exporting data is also easy:

write.csv(jobs_r, 'jobs2.csv')

Help function

you can always ask for documentation, but that function is: help()

help(seq)
help('read.csv')

For packages, there is also a summary about the package and what it does with vignette

vignette('dplyr')
## starting httpd help server ... done

Removing objects

Objects can clog up our RAM, especially if they are large datafiles. If you want to remove an object the function is rm()

rm(x,y)

You can also removing EVERYTHING in your environment with

rm(list = ls())

Readr

A faster way to import data is the package readr - this becomes important with larger datasets where you want to efficiently read data into your computer.

Just remember that it will delete everything and you’d have to import all of your data again

library(readr)
jobs = read_csv('job-automation-probability.csv')
## Rows: 702 Columns: 13── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): _ - code, education, occupation, short occupation, employed_may2016
## dbl (8): _ - rank, prob, Average annual wage, len, probability, numbEmployed...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# IF you want to specify the delimiter  
jobs = read_delim('job-automation-probability.csv',  delim = ',')
## Rows: 702 Columns: 13── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): _ - code, education, occupation, short occupation, employed_may2016
## dbl (8): _ - rank, prob, Average annual wage, len, probability, numbEmployed...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can compare the two ways we imported the dataset, there are differences in the way they import:

#read.csv
jobs_r = read.csv('job-automation-probability.csv')
names(jobs_r)
##  [1] "X_...rank"           "X_...code"           "prob"               
##  [4] "Average.annual.wage" "education"           "occupation"         
##  [7] "short.occupation"    "len"                 "probability"        
## [10] "numbEmployed"        "median_ann_wage"     "employed_may2016"   
## [13] "average_ann_wage"
#read_csv from readr
names(jobs)
##  [1] "_ - rank"            "_ - code"            "prob"               
##  [4] "Average annual wage" "education"           "occupation"         
##  [7] "short occupation"    "len"                 "probability"        
## [10] "numbEmployed"        "median_ann_wage"     "employed_may2016"   
## [13] "average_ann_wage"

You can also import a variety of other formats like stata with Haven.

library(haven)
SAS
#read_sas("mtcars.sas7bdat")
#write_sas(mtcars, "mtcars.sas7bdat")
SPSS
#read_sav("mtcars.sav")
#write_sav(mtcars, "mtcars.sav")
Stata
#read_dta("mtcars.dta")
#write_dta(mtcars, "mtcars.dta")

Data Manipulation

I’ll show some of the commands we used in python

Generally, you’re going to tell R 1. what the dataframe you are manipulation is 2. then the function you want to do

library(dplyr)

Columns

Getting Column Names

To get the names of the columns in the data frame

python jobs.columns

R names(df)

names(jobs)
##  [1] "_ - rank"            "_ - code"            "prob"               
##  [4] "Average annual wage" "education"           "occupation"         
##  [7] "short occupation"    "len"                 "probability"        
## [10] "numbEmployed"        "median_ann_wage"     "employed_may2016"   
## [13] "average_ann_wage"

Select some of the columns:

python

jobs[[‘_ - code’, ‘prob’, ‘Average.annual.wage’, ‘education’, ‘numbEmployed’]]

R

jobs %>% select(c('_ - code', 'prob', 'Average annual wage', 
                  'education', 'numbEmployed'))
## # A tibble: 702 × 5
##    `_ - code`  prob `Average annual wage` education                 numbEmployed
##    <chr>      <dbl>                 <dbl> <chr>                            <dbl>
##  1 51-4033    0.95                  34920 High school diploma or e…        74600
##  2 51-9012    0.88                  41450 High school diploma or e…        47160
##  3 41-4012    0.85                  68410 High school diploma or e…      1404050
##  4 53-1031    0.029                 59800 High school diploma or e…       202760
##  5 51-4072    0.95                  32660 High school diploma or e…       145560
##  6 51-6091    0.88                  35420 High school diploma or e…        19340
##  7 51-4031    0.78                  34210 High school diploma or e…       192800
##  8 41-4011    0.25                  92910 Bachelor's degree               328370
##  9 51-4032    0.94                  38880 High school diploma or e…        12290
## 10 51-9041    0.93                  34370 High school diploma or e…        71260
## # ℹ 692 more rows

A more simplified way to do this in R

select(jobs, c('_ - code', 'prob', 'Average annual wage', 
               'education', 'numbEmployed'))
## # A tibble: 702 × 5
##    `_ - code`  prob `Average annual wage` education                 numbEmployed
##    <chr>      <dbl>                 <dbl> <chr>                            <dbl>
##  1 51-4033    0.95                  34920 High school diploma or e…        74600
##  2 51-9012    0.88                  41450 High school diploma or e…        47160
##  3 41-4012    0.85                  68410 High school diploma or e…      1404050
##  4 53-1031    0.029                 59800 High school diploma or e…       202760
##  5 51-4072    0.95                  32660 High school diploma or e…       145560
##  6 51-6091    0.88                  35420 High school diploma or e…        19340
##  7 51-4031    0.78                  34210 High school diploma or e…       192800
##  8 41-4011    0.25                  92910 Bachelor's degree               328370
##  9 51-4032    0.94                  38880 High school diploma or e…        12290
## 10 51-9041    0.93                  34370 High school diploma or e…        71260
## # ℹ 692 more rows

Multiple Syntax in R

There are multiple ways of calling a dataframe and applying a function.

The first way is df %>% select(c(col_names)

So, this funny thing %>% (called a pipe) is saying that I am going to work with the dataframe named df and I want you to apply a function called select.

I find it much more intuitive to use select(df, c(column_names))

Where, I have a function called select and I’m telling it that the dataframe name df is what I will apply the function select to.

Because I have a preference, we`ll stick to the latter form in the rest of the lecture - but, when you look at stack overflow and get confused as to other notation, recall that it’s the same-ish.

Dropping columns

you just include the negative sign before the column list, and that will drop the selected columns you listed

names(jobs)
##  [1] "_ - rank"            "_ - code"            "prob"               
##  [4] "Average annual wage" "education"           "occupation"         
##  [7] "short occupation"    "len"                 "probability"        
## [10] "numbEmployed"        "median_ann_wage"     "employed_may2016"   
## [13] "average_ann_wage"
select(jobs, -c('probability','_ - rank','employed_may2016' ,'average_ann_wage','len'))
## # A tibble: 702 × 8
##    `_ - code`  prob `Average annual wage` education                   occupation
##    <chr>      <dbl>                 <dbl> <chr>                       <chr>     
##  1 51-4033    0.95                  34920 High school diploma or equ… Grinding,…
##  2 51-9012    0.88                  41450 High school diploma or equ… Separatin…
##  3 41-4012    0.85                  68410 High school diploma or equ… Sales Rep…
##  4 53-1031    0.029                 59800 High school diploma or equ… First-Lin…
##  5 51-4072    0.95                  32660 High school diploma or equ… Molding, …
##  6 51-6091    0.88                  35420 High school diploma or equ… Extruding…
##  7 51-4031    0.78                  34210 High school diploma or equ… Cutting, …
##  8 41-4011    0.25                  92910 Bachelor's degree           Sales Rep…
##  9 51-4032    0.94                  38880 High school diploma or equ… Drilling …
## 10 51-9041    0.93                  34370 High school diploma or equ… Extruding…
## # ℹ 692 more rows
## # ℹ 3 more variables: `short occupation` <chr>, numbEmployed <dbl>,
## #   median_ann_wage <dbl>

Filtering

The same boolean operators in python (and any language) work in the same way.

This handy chart can help you figure out what boolean operators you want to use

python

`jobs[jobs[‘prob’]>.8]’

R

filter(jobs, prob >.8)
## # A tibble: 262 × 13
##    `_ - rank` `_ - code`  prob `Average annual wage` education        occupation
##         <dbl> <chr>      <dbl>                 <dbl> <chr>            <chr>     
##  1        624 51-4033     0.95                 34920 High school dip… Grinding,…
##  2        517 51-9012     0.88                 41450 High school dip… Separatin…
##  3        484 41-4012     0.85                 68410 High school dip… Sales Rep…
##  4        620 51-4072     0.95                 32660 High school dip… Molding, …
##  5        518 51-6091     0.88                 35420 High school dip… Extruding…
##  6        590 51-4032     0.94                 38880 High school dip… Drilling …
##  7        584 51-9041     0.93                 34370 High school dip… Extruding…
##  8        477 51-4034     0.84                 39630 High school dip… Lathe and…
##  9        560 51-4021     0.91                 35340 High school dip… Extruding…
## 10        637 51-6064     0.96                 28110 High school dip… TextileWi…
## # ℹ 252 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## #   numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## #   average_ann_wage <dbl>

Python jobs[jobs[‘education’]==‘High school diploma or equivalent’]

R filter(df, column == ‘value’)

filter(jobs, education == 'High school diploma or equivalent')
## # A tibble: 307 × 13
##    `_ - rank` `_ - code`  prob `Average annual wage` education        occupation
##         <dbl> <chr>      <dbl>                 <dbl> <chr>            <chr>     
##  1        624 51-4033    0.95                  34920 High school dip… Grinding,…
##  2        517 51-9012    0.88                  41450 High school dip… Separatin…
##  3        484 41-4012    0.85                  68410 High school dip… Sales Rep…
##  4        105 53-1031    0.029                 59800 High school dip… First-Lin…
##  5        620 51-4072    0.95                  32660 High school dip… Molding, …
##  6        518 51-6091    0.88                  35420 High school dip… Extruding…
##  7        427 51-4031    0.78                  34210 High school dip… Cutting, …
##  8        590 51-4032    0.94                  38880 High school dip… Drilling …
##  9        584 51-9041    0.93                  34370 High school dip… Extruding…
## 10        477 51-4034    0.84                  39630 High school dip… Lathe and…
## # ℹ 297 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## #   numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## #   average_ann_wage <dbl>

In pandas, we often used a tilda (~) to exclude something, In R you use an exclamation mark (!)

#python jobs[~[(education == ‘High school diploma or equivalent’ | education ==‘No formal educational credential’)]]

R filter(df,!(column == ‘value’ | column ==‘value’))

filter(jobs, !(education == 'High school diploma or equivalent' 
               | education =='No formal educational credential'))
## # A tibble: 297 × 13
##    `_ - rank` `_ - code`   prob `Average annual wage` education       occupation
##         <dbl> <chr>       <dbl>                 <dbl> <chr>           <chr>     
##  1        228 41-4011    0.25                   92910 Bachelor's deg… Sales Rep…
##  2        554 49-2093    0.91                   59840 Postsecondary … Electrica…
##  3        208 15-1179    0.21                   67770 Associate's de… Informati…
##  4        254 49-2022    0.36                   54520 Postsecondary … Telecommu…
##  5        103 17-2111    0.028                  90190 Bachelor's deg… Health an…
##  6        205 25-3011    0.19                   55140 Bachelor's deg… Adult Bas…
##  7        277 49-2094    0.41                   56990 Postsecondary … Electrica…
##  8         41 25-2031    0.0078                 61420 Bachelor's deg… Secondary…
##  9        261 49-2095    0.38                   74540 Postsecondary … Electrica…
## 10        200 25-2022    0.17                   59800 Bachelor's deg… Middle Sc…
## # ℹ 287 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## #   numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## #   average_ann_wage <dbl>

and our good friend, is.na()

python
jobs[jobs[‘prob’].isnull()]

Note: missing is = ‘NA’

filter(jobs, is.na(prob))
## # A tibble: 0 × 13
## # ℹ 13 variables: _ - rank <dbl>, _ - code <chr>, prob <dbl>,
## #   Average annual wage <dbl>, education <chr>, occupation <chr>,
## #   short occupation <chr>, len <dbl>, probability <dbl>, numbEmployed <dbl>,
## #   median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>

and to drop na items

python jobs[‘prob’].drop_na()

R filter(df, !is.na(column))

filter(jobs, !is.na(prob))
## # A tibble: 702 × 13
##    `_ - rank` `_ - code`  prob `Average annual wage` education        occupation
##         <dbl> <chr>      <dbl>                 <dbl> <chr>            <chr>     
##  1        624 51-4033    0.95                  34920 High school dip… Grinding,…
##  2        517 51-9012    0.88                  41450 High school dip… Separatin…
##  3        484 41-4012    0.85                  68410 High school dip… Sales Rep…
##  4        105 53-1031    0.029                 59800 High school dip… First-Lin…
##  5        620 51-4072    0.95                  32660 High school dip… Molding, …
##  6        518 51-6091    0.88                  35420 High school dip… Extruding…
##  7        427 51-4031    0.78                  34210 High school dip… Cutting, …
##  8        228 41-4011    0.25                  92910 Bachelor's degr… Sales Rep…
##  9        590 51-4032    0.94                  38880 High school dip… Drilling …
## 10        584 51-9041    0.93                  34370 High school dip… Extruding…
## # ℹ 692 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## #   numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## #   average_ann_wage <dbl>

And you can rename variables in place

python equivalent jobs[[‘_ - code , _ - rank’]] #.rename(columns={‘_ - code’: ‘code’ , ‘_ - rank’:‘rank’}) R select(df, new_name=oldname)

select(jobs, code= '_ - code' , '_ - rank')
## # A tibble: 702 × 2
##    code    `_ - rank`
##    <chr>        <dbl>
##  1 51-4033        624
##  2 51-9012        517
##  3 41-4012        484
##  4 53-1031        105
##  5 51-4072        620
##  6 51-6091        518
##  7 51-4031        427
##  8 41-4011        228
##  9 51-4032        590
## 10 51-9041        584
## # ℹ 692 more rows

to rename just select columns, but keep the whole dataframe

python jobs.rename(columns={‘_ - code’: ‘code’ , ‘_ - rank’:‘rank’})

R rename(df, new_name=old_name)

rename(jobs, code='_ - code' , rank= '_ - rank')
## # A tibble: 702 × 13
##     rank code     prob `Average annual wage` education                occupation
##    <dbl> <chr>   <dbl>                 <dbl> <chr>                    <chr>     
##  1   624 51-4033 0.95                  34920 High school diploma or … Grinding,…
##  2   517 51-9012 0.88                  41450 High school diploma or … Separatin…
##  3   484 41-4012 0.85                  68410 High school diploma or … Sales Rep…
##  4   105 53-1031 0.029                 59800 High school diploma or … First-Lin…
##  5   620 51-4072 0.95                  32660 High school diploma or … Molding, …
##  6   518 51-6091 0.88                  35420 High school diploma or … Extruding…
##  7   427 51-4031 0.78                  34210 High school diploma or … Cutting, …
##  8   228 41-4011 0.25                  92910 Bachelor's degree        Sales Rep…
##  9   590 51-4032 0.94                  38880 High school diploma or … Drilling …
## 10   584 51-9041 0.93                  34370 High school diploma or … Extruding…
## # ℹ 692 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## #   numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## #   average_ann_wage <dbl>

you can select information that contains a value will select only the column name related to X

R select(df, contains(‘value’))

select(jobs, contains("X"))
## # A tibble: 702 × 0
names(jobs)
##  [1] "_ - rank"            "_ - code"            "prob"               
##  [4] "Average annual wage" "education"           "occupation"         
##  [7] "short occupation"    "len"                 "probability"        
## [10] "numbEmployed"        "median_ann_wage"     "employed_may2016"   
## [13] "average_ann_wage"

don’t forget you have to override the information if you want to save over the variable

jobs <-rename(jobs, code= '_ - code' , rank= '_ - rank', avg_ann_wage = "Average annual wage")

Ordering/Sorting

We can sort values in a dataframe with the function, arrange()

It takes a data frame and a set of column names (or more complicated expressions) to order by.

Here, we are going in order of probability first and if there is a tie, education level breaks said tie

python jobs.sort_values([‘prob’, ‘education’], ascend=FALSE)

r arrange(df, column_name)

arrange(jobs, prob,education)
## # A tibble: 702 × 13
##     rank code    prob avg_ann_wage education occupation `short occupation`   len
##    <dbl> <chr>  <dbl>        <dbl> <chr>     <chr>      <chr>              <dbl>
##  1     1 29-1… 0.0028        48190 Bachelor… Recreatio… Recreational Ther…    23
##  2     3 11-9… 0.003         78060 Bachelor… Emergency… Emergency Managem…    30
##  3     2 49-1… 0.003         66730 High sch… First-Lin… First-Line Superv…    61
##  4     4 21-1… 0.0031        47880 Bachelor… Mental He… Mental Health and…    48
##  5     5 29-1… 0.0033        79290 Doctoral… Audiologi… Audiologists          12
##  6     7 29-2… 0.0035        69920 Master's… Orthotist… Orthotists and Pr…    27
##  7     8 21-1… 0.0035        55510 Master's… Healthcar… Healthcare Social…    25
##  8     6 29-1… 0.0035        83730 Master's… Occupatio… Occupational Ther…    23
##  9     9 29-1… 0.0036       232870 Doctoral… Oral and … Oral and Maxillof…    31
## 10    10 33-1… 0.0036        77050 Postseco… First-Lin… First-Line Superv…    62
## # ℹ 692 more rows
## # ℹ 5 more variables: probability <dbl>, numbEmployed <dbl>,
## #   median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>

r arrange(df, desc(column_name))

arrange(jobs, desc(prob))
## # A tibble: 702 × 13
##     rank code    prob avg_ann_wage education occupation `short occupation`   len
##    <dbl> <chr>  <dbl>        <dbl> <chr>     <chr>      <chr>              <dbl>
##  1   694 51-91…  0.99        31740 High sch… Photograp… Photographic Proc…    61
##  2   701 23-20…  0.99        51490 High sch… Title Exa… Title Examiners, …    42
##  3   696 43-50…  0.99        44250 High sch… Cargo and… Cargo and Freight…    24
##  4   699 15-20…  0.99        58490 Bachelor… Mathemati… Mathematical Tech…    24
##  5   698 13-20…  0.99        75480 Bachelor… Insurance… Insurance Underwr…    22
##  6   692 25-40…  0.99        34780 Postseco… Library T… Library Technicia…    19
##  7   693 43-41…  0.99        36480 High sch… New Accou… New Accounts Cler…    19
##  8   691 43-90…  0.99        31640 High sch… Data Entr… Data Entry Keyers     17
##  9   697 49-90…  0.99        39720 High sch… Watch Rep… Watch Repairers       15
## 10   695 13-20…  0.99        45340 High sch… Tax Prepa… Tax Preparers         13
## # ℹ 692 more rows
## # ℹ 5 more variables: probability <dbl>, numbEmployed <dbl>,
## #   median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>

Mutate

You may want to add a new columns that are functions of existing columns - and that function is mutate()

mutate() always adds new columns at the end of your dataset.

You can create a whole variety of new variables, as in python. Here are some useful tips on this:

Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length.

Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces.

Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.

Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). This is useful for regressions with time series.

Logical comparisons, <, <=, >, >=, !=, and == If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

here, we can see that the new variable diff is added on at the end

python jobs[‘diff’] = jobs[‘avg_ann_wage’]-jobs[‘median_ann_wage’]

r

mutate(jobs,  
      diff = avg_ann_wage - median_ann_wage)
## # A tibble: 702 × 14
##     rank code    prob avg_ann_wage education occupation `short occupation`   len
##    <dbl> <chr>  <dbl>        <dbl> <chr>     <chr>      <chr>              <dbl>
##  1   624 51-40… 0.95         34920 High sch… Grinding,… Tool setters, ope…    35
##  2   517 51-90… 0.88         41450 High sch… Separatin… Tool setters, ope…    35
##  3   484 41-40… 0.85         68410 High sch… Sales Rep… Sales Representat…    92
##  4   105 53-10… 0.029        59800 High sch… First-Lin… Supervisors Trans…    26
##  5   620 51-40… 0.95         32660 High sch… Molding, … Molding, Coremaki…    89
##  6   518 51-60… 0.88         35420 High sch… Extruding… Extruding and For…    88
##  7   427 51-40… 0.78         34210 High sch… Cutting, … Cutting, Punching…    85
##  8   228 41-40… 0.25         92910 Bachelor… Sales Rep… Sales Representat…    85
##  9   590 51-40… 0.94         38880 High sch… Drilling … Drilling and Bori…    82
## 10   584 51-90… 0.93         34370 High sch… Extruding… Extruding, Formin…    82
## # ℹ 692 more rows
## # ℹ 6 more variables: probability <dbl>, numbEmployed <dbl>,
## #   median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>,
## #   diff <dbl>

If you only want to keep the new variables, use transmute()

python diff = jobs[‘avg_ann_wage’] - jobs[‘median_ann_wage’]

transmute(jobs, 
      diff = avg_ann_wage - median_ann_wage)
## # A tibble: 702 × 1
##     diff
##    <dbl>
##  1  2030
##  2  3090
##  3 11270
##  4  2530
##  5  2180
##  6  1180
##  7  1840
##  8 13930
##  9  2470
## 10  1860
## # ℹ 692 more rows

You can combine mutate with boolean filters

jobs %>% filter(occupation %in% c('Economists')) %>% mutate(
      diff = avg_ann_wage - median_ann_wage)
## # A tibble: 1 × 14
##    rank code     prob avg_ann_wage education occupation `short occupation`   len
##   <dbl> <chr>   <dbl>        <dbl> <chr>     <chr>      <chr>              <dbl>
## 1   282 19-3011  0.43       112860 Master's… Economists Economists            10
## # ℹ 6 more variables: probability <dbl>, numbEmployed <dbl>,
## #   median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>,
## #   diff <dbl>

Groupby and Summarize

We can get simple summary statistics of our dataframe, like we did in pandas with describe (but a but more complicated)

In R, the function is the british spelling, `summarise()’

*technically, it works with a z, too.

This is is the mean probability of the entire dataset excluding any values that are NA:

summarise(jobs, prob = mean(prob, na.rm=TRUE))
## # A tibble: 1 × 1
##    prob
##   <dbl>
## 1 0.536

We can use summarise in conjuction with groupby, which is the same process as in python pandas. It will split the data into groups that you need and then you will use the summarise function to apply a statistic to those groups.

we can create multiple new items in one groupby function:

by_educ <- group_by(jobs, education)
educ_wage <-summarise(by_educ, av_wage_educ = mean(avg_ann_wage, na.rm=TRUE),
         count = n())

Join

Merge dataframes together. Like pandas, you can join dataframes as left, right, inner or outer. There are similar combinations with some exceptions. You can check out the documentation here for more details.

the general format is:

join_type(df1, df2, by=c(“key1_name”, “key2_name”))

by_educ <- group_by(jobs, education) #Create another dataframe by education group
educ_emp <-summarise(by_educ, Nemp = sum(numbEmployed, na.rm=TRUE)) # That contains the number of employees by educaiton group

python

pd.merge(educ_emp, educ_wage, on =‘education’, how=‘left’)

R

left_join(educ_emp,educ_wage, by= c("education"))
## # A tibble: 8 × 4
##   education                             Nemp av_wage_educ count
##   <chr>                                <dbl>        <dbl> <int>
## 1 Associate's degree                 2993610       56492.    44
## 2 Bachelor's degree                 25946820       80602.   155
## 3 Doctoral or professional degree    2424010      126743.    23
## 4 High school diploma or equivalent 49420870       44011.   307
## 5 Master's degree                    1926250       75966.    29
## 6 No formal educational credential  38642320       33031.    98
## 7 Postsecondary nondegree award      6823230       48555.    42
## 8 Some college, no degree            2981570       44516.     4

Breakout Group Exercises

I want you to start to get familiar with R already and deal with any trouble shooting issues that you might have.

  1. set your working directory to where the jobs data is located
  2. import the data “job-automation-probability.csv”
  3. select the data columns short.occupation, education, prob, average_ann_wage, X_…code
  4. calculate the minimum probability by ‘education’
  5. create a new variable that calculates the difference between the minimum and maximum probability values by education and ensure that item 4 is in the same dataframe.

Next Week

What is “tidy” data?

Resources:

  • Vignette (from the tidyr package)

  • Original paper (Hadley Wickham, 2014 JSS) <- this author is the same as the book I mentioned earlier

Going into the tidyverse…